Floria: fast and accurate strain haplotyping in metagenomes

Abstract Summary Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is  > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria’s short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. Availability and implementation Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.


Introduction
To accurately assess the full genetic potential of microbial communities and capture their evolutionary and ecological dynamics, it is often necessary to resolve genomes at the strain level (Goyal et al. 2022).This is because there can be significant phenotypic variation between different strains of the same species (Pierce andBernstein 2016, van Opijnen et al. 2016), e.g.Escherichia coli being either pathogenetic (Leimbach et al. 2013) or probiotic (Sonnenborn 2016) in a strain-dependent manner.Thus, disambiguating strains within human microbiomes has important implications for human health and precision medicine (Fukuda et al. 2011, Federici et al. 2022).Yet, many widely used metagenomic workflows (Uritskiy et al. 2018, Moss et al. 2020) are not designed to recover multiple highly similar strain genomes.This often results in assemblies where less abundant strains are not recovered, potentially missing out on ecologically and medically important genetic features of the microbial diversity present.
Computational strain recovery comes in different forms.Strain haplotyping (or phasing) is the recovery of alleles, such as SNPs, that co-occur along the same chromosome (Browning and Browning 2011), or in the case of haploid microbes, the same strain in a community (haplotype) (Nicholls et al. 2021).We are interested in using read overlap information to link alleles along the same strain haplotype.In this case, by clustering reads into strain-level clusters, one can also recover the sequence of alleles for each strain (Bonizzoni et al. 2016).On the other hand, strain-level assembly is the recovery of all genomic content for each strain by base-level de novo assembly (Feng et al. 2022, Benoit et al. 2024).An orthogonal approach is strain-level profiling, where reference genomes from a database are used to identify corresponding strains from a metagenome (Francis et al. 2013, Dilthey et al. 2019, van Dijk et al. 2022).Profiling assumes the existence of a reference genome for each strain in the population and does not reconstruct haplotypes explicitly.In this work, we will focus instead on phasing and assembly.
Short reads are limited in their ability to resolve betweenstrain similarities required to construct high-contiguity strain-level assemblies, but the advent of long-read Oxford Nanopore and PacBio sequencing has opened the door for more complete strain resolution.In particular, PacBio HiFibased assembly methods leverage highly accurate, but more expensive, long reads for strain-level assemblies (Feng et al. 2022, Benoit et al. 2024), but HiFi reads are often not available for population-scale cohorts due to its cost.Therefore, methods that work well on long-reads with higher error rates, as well as short-reads, are still desirable despite their inherent limitations.
For nanopore reads, strain-level assembly is a challenging task at lower coverages or when using older, less accurate chemistries.Even with the Oxford Nanopore's newest R10.4 flow cell, 40× coverage is still recommended (Sereika et al. 2022).Strain-level assembly is also computationally expensive for all technologies.In contrast, because phasing requires resolution of only a sparse subset of alleles, it is more feasible for sequencing data with lower accuracy, while being more computationally efficient than assembly based approaches.Thus, phasing is a very useful task when strain-level assembly may give low-quality contigs or is too time-consuming.However, most existing haplotyping methods have only been designed and tested for short-reads [e.g.DESMAN (Quince et al. 2017), ConStrains (Luo et al. 2015), Gretel (Nicholls et al. 2021), EVORhA (Pulido-Tamayo et al. 2015), BHap (Li et al. 2019)].A few long-read strain-level assemblers with built-in haplotyping capabilities exist, but they have yet to be applied to large cohorts with deep metagenome sequencing data [e.g.Strainberry (Vicedomini et al. 2021), Strainy (Kazantseva et al. 2023)].
In this work, we introduce Floria, a novel strain haplotyping algorithm that can take both long and short-read metagenomic datasets as input.With Floria, the haplotype phasing task takes only minutes on a standard workstation, an order of magnitude faster than assembly, enabling large-scale recovery of microbial haplotypes.Additionally, our package Floria-PL uses the Floria haplotyper to provide a onecommand strain-level assembly pipeline for complex metagenomes that is more accurate and several times faster than existing methods.

Materials and methods
Floria takes a set of mapped reads and SNPs and outputs a clustering of the reads into strain-level sets (Fig. 1B).Each resulting set of reads output by Floria (called a haploset) is a group that is sequenced from the same strain.The consensus sequence of SNPs for a haploset (called a vartig) is also output, along with an estimate of the strain count.Pseudocode is available in Supplementary Methods.Floria's phasing algorithm is written in Rust, a systems-level programming language, for speed.
To facilitate Floria with all possible inputs, Floria has also been packaged into an end-to-end snakemake pipeline called Floria-PL (Fig. 1A).Reads can be either sequentially assembled and mapped or analyzed via a classification approach to map reads to a sample-specific set of reference genomes.Mapped reads, whether to assembled contigs or reference genomes, are then used to call variants and processed by Floria's core algorithm.The pipeline can use the output haplosets to generate phased assemblies (i.e.assembled haplosets) with different assemblers.

Preliminaries and MEC optimization
We first review the problem of polyploid haplotype phasing of a k-ploid organism (Shaw and Yu 2022), analogous to phasing k strains.We represent a read in variant-space as an element of the set f−; 0; 1; 2; 3g m where m is the number of SNPs present in the contig.Let r[i] be the allele of the ith SNP in the read, where we set r½i� ¼ − if the read does not cover the ith SNP, r½i� ¼ 0 for the reference allele, r½i� ¼ 1 for the first alternate allele, and so forth.Paired-end reads are combined into one element in variant-space.We only use SNPs in Floria.
We represent the mapped reads as a set R ¼ fr 1 ; r 2 ; . . .; r n g.One formulation of phasing is to partition R into k sets R 1 ; ; R k where the set of reads R i contains the reads coming from the ith haplotype, and this problem is commonly solved by optimizing the MEC model: given a partition of reads, the MEC score is the minimum number of alleles that must be changed in order to have an error-free phasing.
To define the MEC objective precisely, we first define dðr i ; r j Þ as the number of non "-" alleles that differ between two reads and sðr i ; r j Þ as the number of non "-" alleles that are the same between two reads.We define HðR i Þ to be the consensus haplotype of a set of reads: HðR i Þ is also an element in f−; 0; 1; 2; 3g m where HðR i Þ½j� ¼ a if the majority of reads have allele a at position j.If two alleles are equally supported, we can choose any of the majority alleles.We let HðR i Þ½j� ¼ − if no read covers the jth allele (e.g.due to deletions).Then given a partition R 1 ; :::; R k of R, the MEC score is The MEC minimization problem is to find the partition that minimizes the MEC score; this turns out to be an NPhard problem (Lancia et al. 2001).Thus, we solve it using a heuristic approach we describe below.

Local MEC optimization through beam search
As described in Fig. 1B, step 1, Floria first segments the contig into overlapping, contiguous regions of the genome.All reads that overlap each region is termed a block.Blocks are simply sets of reads, but blocks are not necessarily disjoint.The default block length, the length of the contiguous region corresponding to the block, is the 66th percentile of the read length, but it is adjustable by the user.The blocks overlap by 1/3 of the block length.In the subsequent sections, the read set R corresponds to all reads overlapping a specific block.
We first describe our method for a fixed number of strains, k.We optimize the MEC score based on the heuristic beam search algorithm, variants of which have also been applied to polyploid haplotype phasing (Xie et al. 2016).Our beam search routine is as follows: we start off by sorting the reads in R based on the first SNP the read covers and initialize a set of candidate solutions, Solutions ¼ P, containing a single partition of k empty sets.We then iterate through every read r i in order.During every iteration step, we create a new set of candidate partitions by adding r i to every set in every partition in Solutions, giving us k � jSolutionsj candidate solutions.We then throw away identical partitions and keep the γ lowest MEC score solutions out of all k � jSolutionsj candidate solutions, where γ is some parameter.We repeat this procedure until we use all reads and then return the lowest MEC score partition in the final set of candidate solutions.By default, we keep γ ¼ 10 candidates.

Optimizing the beam search
We introduce two additional heuristics beyond the basic beam search.Firstly, for the first 25 reads, we use γ 0 ¼ k � γ solutions instead to ensure a good initial phasing.Secondly, we use a new probabilistic heuristic to trim poor solutions.We will define D ɛ ðr i ; HðR j ÞÞ to be the exponentiated Kullback-Leibler divergence between the read r i and the consensus haplotype HðR j Þ under an allele mismatch error probability ɛ as follows.First, let a ¼ dðr i ; HðR j ÞÞ dðR i ; HðR j ÞÞ þ sðr i ; HðR j ÞÞ : Here, a represents the fraction of different alleles from the consensus.Then, we define where n ¼ dðr i ; HðR j ÞÞ þ sðr i ; HðR j ÞÞ is the total number of SNPs shared by the haplotype and the read, and T is a constant that can be thought of as a "temperature" parameter that controls for the strictness of this divergence.By default, we set T ¼ 0:25.D ɛ will be small when the read r i is dissimilar to the consensus haplotype, and can also be rigorously interpreted as a binomial test P-value in a limiting regime [Theorem 2 in Arratia and Gordon (1989)].Notably, the exponentiated divergence is minimized when a ¼ ɛ, but we want it to be increasing when a is small, so Floria flips the sign when a < ɛ in the same manner as in a previous method (Shaw and Yu 2022).We use D ɛ ðr i ; HðR j ÞÞ to prune unlikely sets to add a read to during beam search, thus tightening the search space to more plausible solutions.We define the relative goodness of the read assigned to set R j in partition P as Dɛðri;HðRjÞÞ P R2P Dɛðri;HðRÞÞ .We then put r i into the set R j as a putative solution in our beam search only if the relative goodness >0.01.
Given the output of the beam search, we describe in Supplementary Sections S1-S3 in the Supplementary Methods an additional iterative optimization step and how to augment the MEC score with information from the ɛ parameter.

MEC ratio as a threshold to detect local strain count
The above beam search algorithm outputs a partition for a fixed strain count and a single block, but it requires prior knowledge of the number of strains present.To find the number of strains present, we use an iterative heuristic that checks how the MEC score decreases as we increase a putative strain count; a simple probabilistic model of strain phasing (Supplementary Methods S4) was used to inform our heuristic.Note that two strains may look almost identical in a local block.Therefore we will denote the number of strains that are distinct in this local block as the local strain count.
Let MEC � ðkÞ be the MEC score after partitioning R into k sets from our MEC optimization procedure.Let αðkÞ : f1; 2; 3; . ..g !ð0; 1Þ be a decreasing function of k, where we require ð1−ɛÞð1 þ αðkÞÞ > 1 for all k.The algorithm for determining the optimal strain count is as follows: we phase iteratively for increasing putative local strain count k ¼ 2; 3 . . .and check if MEC � ðkÞ=MEC � ðk−1Þ ≥ 1 ð1−ɛÞð1 þ αðkÞÞ holds, i.e. the MEC score is still relatively large after increasing the putative strain count.If inequality holds, we stop phasing and return k−1 as our true local strain count.We let αðkÞ ¼ 1=½k þ 1=3� by default and three additional choices depending on the desired resolution (Supplementary Methods S4).The resulting haplosets contain reads with concordant SNPs but may correspond to multiple similar strains (blue and green).

Flow graph construction on local haplotype blocks
We run the above iterative heuristic for all N blocks over the contig, obtaining a set of partitions P 1 ; P 2 ; . . .; P N with low MEC scores.We then construct a directed acyclic graph (DAG) (Fig. 1B, step 3) and a network flow capturing the flow of coverage between strains across blocks (Fig. 1B, step 4).Let P i ¼ fv 1 i ; v 2 i ; . . .; v ki i g where v j i is the set of reads for the jth local strain in the ith block with local strain count k i .We let fv j i g i;j ¼ V be the set of vertices in the graph and construct a directed edge between v j i !v j 0 i þ 1 for all i; j; j 0 .We weigh the edge by the number of reads shared between v j i and v j 0 i þ 1 and remove the edge if the number of reads shared is <2.
This construction thus far does not describe a flow since the in-flow equals out-flow condition is not necessarily satisfied.We wish to find a reweighing of the edges that is faithful to the edge weights, yet is a true flow.Such a weighting gives a globally aware and thus robust signal that is not affected by local coverage variation or noisy reads being incorrectly partitioned (Fig. 1B, step 4).This problem is similar to flow problems in transcriptome assembly, or viral quasispecies assembly (Tomescu et al. 2013, Shao and Kingsford 2019, Baaijens et al. 2020), but we use a different objective function and path retrieval algorithm.

Exactly solving the flow problem as a linear program
Mathematically, given the graph as described in the previous section with G ¼ ðV; EÞ and weighting w, we wish to find a flow f on the edges that solves min f P e2E jwðeÞ − f ðeÞj where f also has to satisfy the standard flow constraints: for all v 2 V, we require P e2vin f ðeÞ ¼

Strain recovery by vertex-disjoint path decomposition
After obtaining a flow f on our DAG G, we iteratively extract vertex-disjoint paths from source to sink with the largest minimum flow (Fig. 1B, step 5), removing each path from the graph after extraction.For each iteration, this is found in time OðjVjjEjÞ by a standard dynamic programming procedure along a DAG with traceback.We chose the largest minimum flow criterion because it is robust to coverage variation within the strains and high-coverage regions caused by mobile elements.Finally, we take the union of all reads within each path and output haplosets (Fig. 1B, step 6).For each final haploset, we output the reads in the haploset and the consensus sequence of SNPs over all reads in the haploset, which we call the vartig.Since some reads may still be shared across haplosets, we assign each read, r, to only its best haploset by finding the haploset, Hap, that minimizes dðr; HðHapÞÞ.
Resulting phasings may be fragmented (Fig. 1B, step 5), but due to the complexity of real metagenomes, we opted for a conservative vertex-disjoint path decomposition method as opposed to an overlapping path decomposition method (Shao and Kingsford 2019) that is sometimes used for viral or transcriptome assembly.We initially tried using the standard greedy method of iteratively extracting paths with the largest flow and subtracting their flows to get an overlapping path decomposition.However, we found that we would get many duplicate vartigs due to overlapping paths forming from leakage of flow due to noise.

Estimating strain count and haplotype phasing quality
For each contig phased by Floria, we output a statistic called the average strain count, or just strain count for short (not to be confused with local strain count).The strain count of a contig is the average number of haplosets covering each SNP.Unlike the number of strains in a community, strain count can be fractional, e.g. if two strains are identical for half of their genome but contain variation on the other half, the strain count is 1.5.Strain count thus represents a continuous estimate of the heterogeneity of the contig.We also output haplotype phasing qualities (HAPQ), analogous to mapping qualities, MAPQ, in read mapping.We describe our formula and interpretation of HAPQ in Supplementary Methods S5 and its application for strain count estimation in Supplementary Methods S6.

Floria-PL: a complete pipeline for haplotype phasing and assembly with Floria
Floria requires aligned reads and variants.To facilitate a phasing from reads directly, two approaches are implemented in the Floria-PL pipeline (Fig. 1A): users can input a metagenomic assembly or reads can be classified against a Kraken database [UHGG in this work (Almeida et al. 2019) v2.0.1] to retrieve high-quality genome assemblies from species with an estimated coverage high enough for phasing (default 5X).We show the efficacy of this approach by retrieving synthetic spiked-in reads from two close Escherichia species within an Escherichia depleted gut sample (Supplementary Fig. S1, Supplementary Methods S7), and for the synthetic dataset described below by retrieving reads with high accuracy (Supplementary Fig. S2).Reads are mapped using minimap2 (Li 2018) against the selected reference genomes.For variant calling, both Longshot (Edge and Bansal 2019) (v0.4.1) and freebayes (Garrison and Marth 2012) (v2.11.0) are incorporated in Floria-PL.Additionally, haplosets output by Floria can be further processed to generate assemblies, and multiple assemblers were implemented within the pipeline: Megahit (Li et al. 2015) (v1.2.9), Flye (Kolmogorov et al. 2020) (v2.7-b1585, kmer size ¼ 16) and WTDBG2 (Ruan and Li 2020) (v2.5).We further extended this approach to produce a comprehensive assessment pipeline to systematically compare phasing solutions (Supplementary Fig. S3) covering read simulation, different mapping and variant calling tools, and a phased contigs assessment module.

Synthetic communities generation and benchmarking
To produce a realistic strain synthetic community, we selected the 40 most abundant species in a cohort of 109 Singapore gut metagenome samples (Gounot et al. 2022), based on the median abundance obtained from Kraken classifications against the UHGG database.For each species, only genomes with completeness >90% and contamination <5% were kept and sorted using the score: Completeness − 5 × Contamination þ 0.05 × log(N50).The 500 strains with the highest score for each species were retained and compared using skani (Shaw and Yu 2023) (v0.1.0).The resulting pairwise ANI distances were processed to generate strain clusters using an agglomerative clustering (scikit-learn v1.2, 1% ANI threshold, average linkage), and genomes with the highest score within their cluster were assigned as the strain cluster representative.To explore a large panel of strain numbers, the synthetic community was generated by putting together differing numbers of strains for different species, stratified into five groups (number of species/number of strains): 5/1, 10/2, 10/3, 10/4, and 5/5.Genome coverage was defined randomly with a minimal coverage of 5× and a maximal coverage of 25× using a discrete uniform distribution.Nanopore reads were simulated with badread's (Wick 2019) nanopore2020 model at 87.5% mean identity.Description of other synthetic communities can be found in Supplementary Methods S8.Phased assembly results generated from samples with known isolates were assessed in a similar fashion to Strainberry's methodology (Vicedomini et al. 2021).Briefly, query assemblies were compared against each reference strain's genomes using MUMmer4 (Marc¸ais et al. 2018), and each contig was assigned to the closest isolate using a coverage × identity score.Only alignments covering at least 50% of their query were considered.Aligned contigs were concatenated and compared all together against their assigned true genome with MUMmer4 to produce GAGE (Salzberg et al. 2012) assembly evaluation metrics, including genome coverage and identity, percent aligned, duplication ratio, and potential structural variants.We primarily benchmarked against Strainberry and metaFlye.Strainy (Kazantseva et al. 2023) is another recently developed long-read phasing tool, but it is designed for phasing "one or a few bacterial species" currently and extending to larger metagenomes is a "work in progress" according to its latest GitHub commit (hash number 9d6a2d5), so we did not benchmark against it in its current version.Benchmarking of CPU and memory usage is further described in Supplementary Methods S9.

Floria improves recovery and runtime for longread phased assembly of diverse strain mixtures
We first compared Floria to MetaFlye and Strainberry on a 40 gut species (120 strains) synthetic community (see Section 2), using the automatic reference genome identification approach (Fig. 1A, option 1) and Flye as Floria's downstream assembler.Floria improves phased assembly results (Fig. 2A, Supplementary Fig. S4), with a significantly larger portion of strain genomes recovered (21% relative mean improvement, Mann-Whitney U Test, P-value <1 × 10 -6 , Supplementary Fig. S5A), higher contiguity compared to Strainberry (Fig. 2B, Supplementary Fig. S5B) and higher completeness, with an assembly size on average 23% larger (mean ¼ 3.4 Mb and 2.7 Mb for Floria and Stainberry, respectively).Assemblies are not only larger but also have improved quality, with a lower number of misassemblies, especially for translocations (mean ¼ 1.4 versus 5.2 for Floria and Strainberry respectively, Fig. 2C).Floria's phasing is fast, with only 5.3 CPU hours being used to phase samples, and an additional 82 CPU hours to assemble haplosets with Flye, in comparison to 263 CPU hours with Strainberry (phasing and assembly) and 146 CPU hours with MetaFlye (Fig. 2D).Any assembler can be used downstream for Floria depending on the user's preference.Using WTDBG2 instead of Flye, we observed a significant reduction of duplication ratio and misassemblies at the cost of genome completeness, though genome completeness still remained higher than Strainberry (Supplementary Fig. S4).We found that as the abundance of strains becomes smaller, Floria's relative improvement in strain recovery to Strainberry becomes larger, with up to a 33% relative increase in strain genome recovery at 7× coverage (Fig. 2E).The difference in performance is more pronounced when a species has many strains, with Floria being able to phase and assemble species with three or more strains much more completely (þ19% absolute difference of strain genome aligned on average, Mann-Whitney U test P-value <3 × 10 -8 , Supplementary Fig. S6).We found that Strainberry's runtime also increases substantially as the number of strains increases, while Floria's runtime remains low (Fig. 2F).Applying Floria directly on a MetaFlye assembly (Fig. 1A, option 2), we observed a considerable improvement of the strain genomes recovery from the initial assembly (34% relative mean improvement, Supplementary Fig. S6), but with a decreased performance compared to the reference genomes based phasing.This approach remains useful for metagenomic samples without a comprehensive reference genome database available and is currently limited to Floria, as Strainberry is unable to directly phase a metagenomic assembly with a mixture of species of different strain counts.
We further investigated Floria's performance as a function of strain number, coverage, strain divergence, and realistic long-reads on a set of single-species communities described in detail in Supplementary Methods S8.For a 3-strain Klebsiella pneumoniae community with varying coverages, Floria recovers more strain content on average (89.6% versus 74.5% for Floria and Strainberry respectively) on almost all coverage configurations (Fig. 2G) while Strainberry struggles to assemble low-abundance strains when variation in coverage is high (Supplementary Fig. S7).For multiple 2-strain E. coli communities constructed with 7 E.coli strains of increasing divergence, Floria recovers a higher proportion of strain genomes, even for closely related strains compared to Strainberry (Supplementary Fig. S8).Next, we constructed synthetic communities using real reads from multiple isolates from 3 species: Bacillus licheniformis (PRJNA1029794), K. pneumoniae (PRJNA1033449), and E.coli (Greig et al. 2019).For each isolate, we first assembled high-quality ground truths for quality assessment.For all species, Floria showed improvement in genome completeness over Strainberry (Fig. 2H) with most of the strains being resolved at more than 80% completeness (mean of 88% and 75% genome aligned for Floria and Strainberry, respectively).We also compared against WhatsHap Polyphase (Schrinner et al. 2020), a polyploid phaser, by assembling its haplotypespecific reads and found Floria had þ10% absolute difference of strain genome aligned on average, and 7× fewer misassemblies than WhatsHap (Supplementary Fig. S9).In some cases, Floria is able to reconstruct almost the entire strain genome with relatively good accuracy (average ANI ¼ 99.5%) and a low number of misassemblies (average ¼ 4.6 structural variants), while MetaFlye and Strainberry generated either one strain or an incomplete mixture of all strains (Fig. 2I, Supplementary Fig. S10).

i34
Shaw and Gounot et al.

Exploring diversity with nanopore sequencing of 109 gut metagenomes using Floria
To explore Floria's utility for studying strain diversity within diverse microbial communities, we analyzed data from a population study of 109 gut microbiomes with deep nanopore sequencing (Gounot et al. 2022).On average, Floria-PL took <2 h per sample (<25 min for Floria phasing) using 8 CPUs (Fig. 3A).We analyzed >5000 sample-species pairs with sufficient coverage (>5× by default, Supplementary Fig. S10), yielding phasings for >50 species on average per sample (Fig. 3B).The strain number within a species-sample pair has been defined using Floria's strain count estimate with HAPQ superior or equal to 15, subsequently length-normalized across all contigs within a genome and rounded to an integer (Supplementary Methods S6, Fig. 3C).The ability to detect multiple strains for a species in the gut microbiome was correlated with the prevalence (the number of samples the species appears in) of the corresponding genome (Supplementary Fig. S11D).Correspondingly, more abundant species such as Blautia wexelerae A, Bifidobacterium adolescentis, and Faecalibacterium prausnitzii G tend to have a higher proportion of samples with multiple strains (>50%).
We visualize four phasings with high coverage: a 2-strain and a 3-strain E.coli community, a 2-strain B.adolescentis community, and a 2-strain Streptococcus salivarius community (Fig. 3D).For the three 2-strain communities, differential strain coverage confirms our phasings; these haplosets can in principle be binned to give near-complete haplosets.We also inspected the 3-strain E.coli community with the IGV (Robinson et al. 2011) (Supplementary Fig. S12) and found concordant haplosets for three strains.This shows Floria's potential for obtaining phasings and quickly confirming them without any assembly.

Short-read phasing enables tracking of multistrain dynamics in longitudinal samples
We conducted additional short-read phasing benchmarks in Supplementary Methods S10 and Supplementary Fig. S13.Briefly, we simulated short reads on the same 120-strain synthetic community as before and evaluated Floria with the downstream assembler Megahit (Li et al. 2015) against shortread assemblers Megahit and StrainXPress (Kang et al. 2022).On average, Floria recovered 72% of the strain content compared to 50% and 84% for Megahit and StrainXPress but was 2-6× faster and 8× more memory-efficient than StrainXpress.
Figure Assessment of phased assemblies for the 120-strain long-read synthetic metagenome (A-F) as well as single-species multi-strain communities (G-I).Flye was used as Floria's downstream assembler.(A) Percent aligned (i.e.covered by assemblies) for all strain genomes, (B) N50 (log10), (C) number of structural variants within the strain assemblies, and (D).CPU time spent with Floria, Strainberry and MetaFlye for a nanopore synthetic read community of 40 species with 120 strains.(E) Percent aligned of all 120 strain genomes as a function of strain coverage.The lower graph displays the average difference between the two methods.(F) Phasing and assembly runtime as a function of the number of strains for each species.(G) Phased assembly improvement (absolute difference) based on percentage aligned for Floria relative to Strainberry for a synthetic community with 3 K.pneumonia strains.The first strain has 15× coverage.(H) Phased assembly results for pooled, real nanopore reads from multiple strains, showing the aligned percentage of the assembled strains.(I) Circos plot (Krzywinski et al. 2009) of assemblies obtained from a synthetic 3-strain community of pooled, real nanopore reads for B.licheniformis.Bands represent segments of the true genomes with alignments from the assemblies, and red lines are SNPs.N413-11 (marked with an asterisk) was used as the reference genome for mapping and phasing.

Floria i35
Leveraging Floria's short-read phasing capabilities, we investigated a longitudinal set of 24 human gut short-read sequencing samples collected over 636 days from a healthy individual labeled "Donor A" from Watson et al (Watson et al. 2023).For these samples, we called variants with freebayes (with the-pooled-continuous option) on the combined BAM file and phased each sample on the combined VCF file.We then tracked the vartigs across samples with a reciprocal mapping procedure (Supplementary Methods S11).Floria took <2 h (20 threads) and 11 GB of RAM to haplotype all 24 samples.We focus on two species with multiple strains and high abundance across all samples, Faecalibacilus intestinalis and Anaerostipes hadrus (Fig. 4).
Faecalibacillus intestinalis is a recently isolated species (Seo et al. 2019) from the human gut with limited characterization.Exactly one major strain and one minor strain were observed across 636 days.To confirm that two strains were present, we visualized the haplosets output by Floria for the first time point in IGV (Supplementary Fig. S14).The remarkable stability observed here could reflect distinct niches for these strains (Ghoul and Mitri 2016), but other explanations are possible as stability and competition have a nuanced relationship (Butler and O'Dwyer 2018).
As an example of a species with lower stability and interesting gain/loss patterns, we tracked the dynamics of A.hadrus.This analysis highlighted the emergence of a low abundance strain that transitioned to become a high abundance strain between days 391 and 446.To confirm this emergence, we visualized Floria's outputs during day 391 and day 446 (Supplementary Fig. S15) in IGV, showing that lowcoverage haplosets on day 391 have identical alleles as the high-coverage haplosets on day 446.This emergent strain was not the same as the minor strain on day 0, as indicated by the high number of alternate alleles in the emergent strain, whereas the minor strain on day 0 contains mostly reference alleles (Fig. 4).There are thus at least 3 strains present over this timescale.Notably, the low-coverage strain on day 391 has <1/15th of the coverage of the major strain, demonstrating that even low-abundance strains can be detected and tracked with Floria.Finally, we observed that for both species, strains observed at other time points were lost only in the sample for day 614.This could be an indication of a substantial perturbation to the gut microbiome, but more likely it could be an indication of mislabeling for the day 614 sample.

Conclusion
The importance of strain-level analyses, combined with everincreasing amounts of sequencing data, implies that accurate and efficient methods for unraveling strains will be of interest for the foreseeable future.In response, we developed Floria, a metagenome haplotype phasing tool, along with an associated assembly pipeline that can rapidly extract microbial haplotypes for downstream analysis.Floria's phasing and assembly capabilities provide a path forward for large-scale analyses of diverse metagenomic data with even short and noisy long-reads.For future work, hybrid short and noisy long-read approaches for haplotyping to leverage the throughput and accuracy of short reads would be interesting.However, it should be noted that the accuracy of noisy long-reads is continuously improving.Another potential improvement to Floria is considering indels and structural variations in the phasing process, as currently only SNPs are used.

Figure 1 .
Figure1.Overview of the phasing pipeline, Floria-PL, and core phasing algorithm, Floria.(A) The Floria-PL pipeline first processes short or long reads by either metagenomic assembly or by identifying a set of reference genomes (default).Read mapping and variant calling are then performed.These outputs are then given to the Floria phasing algorithm.The phasing results can be optionally assembled to generate base-level assemblies.(B) The Floria phasing algorithm first uses SNPs and the minimum error correction (MEC) model to cluster the reads in local genomic segments into local strain clusters (steps 1-2).Red, green, and blue represent different true strains, with the blue and green strains having high local similarity.A directed acyclic graph (DAG) is constructed from clusters and a coverage-preserving network flow is found by linear programming (LP) to smooth out noise (steps 3-4).Floria iteratively finds optimal paths under a max-min flow condition by dynamic programming (DP) and outputs clusters of reads called haplosets (steps 5-6).The resulting haplosets contain reads with concordant SNPs but may correspond to multiple similar strains (blue and green).

i32
Shaw and Gounot et al.

P
e2vout f ðeÞ with v in and v out being the edges into and out of a node.This function is solvable by linear programming (LP).We first introduce auxiliary variables tðeÞ and the linear constraints tðeÞ ≥ f ðeÞ − wðeÞ and tðeÞ ≥ wðeÞ − f ðeÞ.It follows that tðeÞ ≥ jwðeÞ − f ðeÞj and minimizing tðeÞ corresponds to minimizing jwðeÞ − f ðeÞj.Then we just solve the LP min f X e2E tðeÞ subject to 8e 2 E; tðeÞ ≥ f ðeÞ − wðeÞ and tðeÞ ≥ wðeÞ − f ðeÞ along with the original flow constraints on f to give us the resulting network flow.We solve this LP using the minilp (Zatelepin 2022) package, an open-source LP solver.

Figure 3 .
Figure 3. Phasing of 109 gut nanopore metagenomes with Floria.(A) Runtime of each major component of the phasing pipeline for each sample.(B) Number of species that have been phased by Floria for all 109 SPMP samples.(C) Heatmap of the number of strains within the population for the 40 most abundant species based on short-read abundance.Species are sorted by prevalence (i.e. the no. of samples a species appears).(D) Coveragevartig plots with sample and species denoted.Each bar is a vartig (a contiguous block of phased SNPs) that spans positions shown in the x-axis and is colored by its fraction of alternate alleles.

Figure 4 .
Figure 4. Strain tracking of 24 longitudinal short-read samples from Donor A in Watson et al. (2023) for Faecalibacilus intestinalis and A.hadrus strains over 636 days.Each line represents a vartig-vartig correspondence colored by the percentage of alternate alleles relative to the reference for the paired vartigs.The density of vartigs is shown in black along each sampling date.The coverage is log2(x þ 1) transformed to stabilize variance (Ahlmann-Eltze and Huber 2023).Note the nonuniformity of the sampling dates.The emergence of a low-abundance strain is highlighted for A.hadrus.